Skip to content

Conversation

@tolgacangoz
Copy link

@tolgacangoz tolgacangoz commented Jul 5, 2025

This PR proposes to fix #188

Save the Weights & Biases run ID to the checkpoint file during training and load it when resuming from a checkpoint. This ensures logging continues in the same run, preventing the creation of new runs upon job restart.

I am new to this library, so this PR is open for any suggestions and simplifications.

@sayakpaul @a-r-r-o-w

Saves the Weights & Biases run ID to the checkpoint file during training.

When resuming from a checkpoint, this ID is loaded and used to initialize the W&B tracker, ensuring that logging continues in the same run. This prevents the creation of new, separate runs when a job is restarted.
@tolgacangoz tolgacangoz changed the title Propose to fix wandb session not re-used when resume_from_checkpoint is used Propose to fix wandb session not re-used when resume_from_checkpoint is used Jul 5, 2025
Adds a comprehensive test suite to verify that wandb runs can be correctly resumed from a saved checkpoint. This prevents the creation of a new wandb run upon resumption, ensuring a continuous experiment history.

The tests cover the following scenarios:
- The core logic of resuming a run using a `resume_run_id`.
- Verification that both `PTDCheckpointer` and `AccelerateCheckpointer` save the `wandb_run_id`.
- The end-to-end resumption flow for `SFTTrainer` and `ControlTrainer`.
- Introspection checks to confirm trainers include the necessary logic to extract and use the run ID from a checkpoint.

Fixes huggingface#188
Adds comprehensive regression tests to reproduce the wandb run resumption failure reported in issue huggingface#188.

The new tests simulate a full training lifecycle:
1. Start a training run and log metrics with the `WandbTracker`.
2. Save a checkpoint partway through.
3. Stop the initial run.
4. Start a new session and load the checkpoint.
5. Initialize a new `WandbTracker` using the run ID from the checkpoint.

The tests assert that the resumed tracker uses the original wandb run ID, rather than creating a new run. Separate tests are included for both the `AccelerateCheckpointer` and `PTDCheckpointer` to ensure the bug is captured for both implementations.

Fixes huggingface#188
Introduces a new integration test to verify that the WandB session is correctly resumed when training continues from a saved checkpoint.

This ensures that experiment tracking data is consolidated into a single WandB run across multiple training sessions, rather than creating a new run upon each resumption.
…int argument type in SFTTrainerLoRAWandbResumeTests
@tolgacangoz tolgacangoz deleted the fix-wandb-resuming branch October 30, 2025 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

wandb session not re-used when resume_from_checkpoint is used

1 participant